Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor readers to use type parameters and not concrete vtables #207

Closed

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Jul 6, 2024

Problem

As described in gyscos/zstd-rs#288 (comment), our usage of &'a mut dyn Read in ZipFileReader stops us from being able to impl Send or Sync, or generally to send a ZipFile into a separate thread than the one owning the ZipArchive it belongs to. While this is generally not an issue, as the single synchronous Read + Seek handle makes it difficult to farm out work to subthreads, for the purposes of #72, we would like to be able to reuse some code instead of creating an entirely separate implementation.

That brings us to the second set of issues, surrounding zipcrypto support:

  • all of our readers go through the CryptoReader enum and the make_crypto_reader() method, even though zip crypto is a very uncommon use case,
  • ZipFile itself contains some very complex mutable state with both a ZipFileReader and a CryptoReader, even though ZipFileReader itself also contains a CryptoReader,
  • Crc32Reader has to maintain a special flag to disable CRC checking in the special case of AE2 encryption, which is a sign of a leaky abstraction and may easily introduce a failure to check CRC in other cases by accident in the future.

Solution

Luckily, all of this becomes significantly easier without too many changes:

  • Create src/unstable/read.rs to reimplement read::ZipFile with the new unstable::read::ZipEntry.
  • Make EntryReader and ZipEntry with a parameterized reader type R instead of an internal &'a mut dyn Read.
  • Make new versions of e.g. ZipArchive::by_index() returning ZipResult<ZipEntry<'_, impl Read + '_>> to avoid leaking internal structs while retaining the ability to rearrange the layout of our internal structs.
  • Vastly improve readability of CryptoReader creation through CryptoVariant.

This also leads to a significantly improved refactoring of streaming support in the streaming submodule of src/unstable/read.rs.

Result

zipcrypto support doesn't muck up the non-crypto methods, ZipEntry is now Send, and src/unstable/read.rs is significantly easier to read with equivalent functionality. This will have to be a breaking change in order to achieve Send-ability, but I believe the readability/maintainability is substantially improved with this change.

src/read.rs Dismissed Show dismissed Hide dismissed
let mut data = ZipFileData::from_local_block(block, reader)?;

match parse_extra_field(&mut data) {
/* FIXME: check for the right error type here instead of accepting any old i/o

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
// We can't use the typical ::parse() method, as we follow separate code paths depending
// on the "magic" value (since the magic value will be from the central directory header
// if we've finished iterating over all the actual files).
/* TODO: smallvec? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
}
if let Some(info) = data.aes_mode {
#[cfg(not(feature = "aes-crypto"))]
/* TODO: make this into its own EntryReadError error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
src/crc32.rs Dismissed Show dismissed Hide dismissed
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
if self.check == res {
Ok(())
} else {
/* TODO: make this into our own Crc32Error error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
- also use git dep for zstd to pull in removal of R: BufRead bound
- move new methods into unstable/read.rs
- support getting aes verification info from a raw ZipEntry
- flesh out other methods from ZipFile -> ZipEntry
- now create a streaming abstraction
- move some methods around
src/crc32.rs Outdated
return self
.check_matches()
.map(|()| 0)
/* TODO: use io::Error::other for MSRV >=1.74 */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/crc32.rs Outdated
if self.check == res {
Ok(())
} else {
/* TODO: make this into our own Crc32Error error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/crc32.rs Outdated
if self.check == res {
Ok(())
} else {
/* TODO: make this into our own Crc32Error error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/crc32.rs Outdated
if self.check == res {
Ok(())
} else {
/* TODO: make this into our own Crc32Error error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@cosmicexplorer cosmicexplorer marked this pull request as ready for review July 16, 2024 23:10
@cosmicexplorer cosmicexplorer added major This >~1000 SLOC PR is likely to cause hard-to-resolve conflicts with other open PRs once merged. rust Pull requests that update Rust code labels Jul 16, 2024
src/crc32.rs Dismissed Show dismissed Hide dismissed
Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review.

Cargo.toml Show resolved Hide resolved
src/crc32.rs Show resolved Hide resolved
src/read.rs Show resolved Hide resolved
src/read.rs Show resolved Hide resolved
src/read.rs Show resolved Hide resolved
@@ -0,0 +1,751 @@
//! Alternate implementation of [`crate::read`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document when to use this and when to use the main impl. Also, please remove everything we're not using internally -- we don't need a second public API.

use crate::unstable::read::{
construct_decompressing_reader, find_entry_content_range, CryptoEntryReader, CryptoVariant,
};
pub use crate::unstable::read::{ArchiveEntry, ZipEntry};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be public.

Suggested change
pub use crate::unstable::read::{ArchiveEntry, ZipEntry};
pub(crate) use crate::unstable::read::{ArchiveEntry, ZipEntry};

src/unstable/read.rs Show resolved Hide resolved
src/unstable/read.rs Show resolved Hide resolved
src/unstable/read.rs Show resolved Hide resolved
@Pr0methean Pr0methean added the Please address review comments Some review comments are still open. label Jul 19, 2024
@Pr0methean Pr0methean added the Please fix rebase conflicts Please rebase this PR against master and fix the conflicts. label Jul 27, 2024
@Pr0methean
Copy link
Member

https://github.com/zip-rs/zip2/tree/wrapped-reader-refactor is a squashed and rebased version of this PR, but it doesn't build (mainly due to type mismatches involving BufRead).

@Pr0methean
Copy link
Member

Pr0methean commented Aug 3, 2024

Update: the branch now passes unit tests, and fuzz tests are running on it; but the new code specialized for seekable files is currently unused. @cosmicexplorer Is this a good place for you to resume work from?

@Pr0methean Pr0methean removed the Please fix rebase conflicts Please rebase this PR against master and fix the conflicts. label Aug 3, 2024
@cosmicexplorer
Copy link
Contributor Author

You rock!!! I have just rebased, will address your comments now.

@cosmicexplorer
Copy link
Contributor Author

So @Pr0methean I'd like to raise a small discussion here: the reason I initially made this a draft PR is because I believe there are several shortcomings that necessitate a slight reimplementation, some requiring breaking API changes:

  1. propagating Send and/or Sync bounds from R to ZipFile<R> is literally impossible with the current public API that type-erases to dyn Read.
    • We could make two separate copies, one with dyn Read + Send, but that would be a lot of trouble and not worth it.
    • We could just assert that R is Send and always carry a dyn Read + Send, but that would be unsound.
  2. ZipFile has a lot of stateful logic, including a ZipFileReader::NoReader state, even necessitating an invalid_state() helper method for error checking.
    • Notably (and imo inexcusably), ZipFile has a Drop impl which is only used when reading stream archives (to ensure we read out the rest of the entry from the stream), but this niche use case necessitates every reader struct in the crate to provide an .into_inner() method.
  3. To underline the reason this Drop impl for ZipFile is a bad idea, there is currently a bug in ZipStreamReader which drops the first ZipStreamFileMetadata entry, because read_zipfile_from_stream() does not save the contents of the last ZipLocalEntryBlock it reads which has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE.
    • Basically, streaming requires special handling, not being bolted onto ZipFile.
  4. Because streaming requires special handling and separate structs for entries, we also need to be able to get access to all the methods for accessing ZipFileData that we have on ZipFile, but for other structs. That's why this change introduces EntryData, which implements all the methods traditionally provided as member functions of ZipFile.
    • This means we don't need to go through the ZipStreamVisitor interface, and we can have direct control over when to iterate through each entry of the streaming zip file. That's what zip::unstable::read::streaming provides.
    • Importantly, StreamingArchive maintains the internal state necessary to retain the bytes of the first metadata entry after we go through all the local file entries, fixing the bug in ZipStreamReader which drops the first metadata entry.
  5. Finally, ZipFileReader::Compressed(Box<Crc32Reader<Decompressor<io::BufReader<CryptoReader<'a>>>>>) always goes through the logic for CryptoReader, when encrypted zip files are an extremely niche use case. We shouldn't be cluttering up our type definitions or our stack traces with encryption logic when most users are never going to touch it!
        let final_reader = if data.encrypted {
            let crypto_variant = CryptoKeyValidationSource::from_data(data)?;
            let is_ae2_encrypted = crypto_variant.is_ae2_encrypted();
            let crypto_reader = crypto_variant.make_crypto_reader(password, content)?;
            let entry_reader =
                construct_decompressing_reader(&data.compression_method, crypto_reader)?;
            if is_ae2_encrypted {
                /* Ae2 voids crc checking: https://www.winzip.com/en/support/aes-encryption/ */
                CryptoEntryReader::Ae2Encrypted(entry_reader)
            } else {
                CryptoEntryReader::NonAe2Encrypted(NewCrc32Reader::new(entry_reader, data.crc32))
            }
        } else {
            /* Not encrypted, so do the same as in .by_index_generic(): */
            let entry_reader = construct_decompressing_reader(&data.compression_method, content)?;
            CryptoEntryReader::Unencrypted(NewCrc32Reader::new(entry_reader, data.crc32))
        };

So that's my case for the API changes. We can do them in pieces, but I'm proposing this intentionally breaking change that I believe will improve the internal architecture and correctness, increase our ability to perform optimizations, and generate more flexible entry structs in our public API.

@cosmicexplorer
Copy link
Contributor Author

I'm experiencing an incredibly strange issue on all web browsers where pushing to this branch isn't being propagated to this PR, and comparing https://github.com/zip-rs/zip2/compare/master...cosmicexplorer:zip:wrapped-reader-refactor?expand=1 acts like they're the same branch. I'm going to close this and create a new PR with the above proposal and see if that fixes this weird issue.

@cosmicexplorer cosmicexplorer mentioned this pull request Aug 11, 2024
2 tasks
@cosmicexplorer
Copy link
Contributor Author

Moved to #233 which fixes the weird github error in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major This >~1000 SLOC PR is likely to cause hard-to-resolve conflicts with other open PRs once merged. Please address review comments Some review comments are still open. rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants